22 research outputs found
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
Breaking the "Object" in Video Object Segmentation
The appearance of an object can be fleeting when it transforms. As eggs are
broken or paper is torn, their color, shape and texture can change
dramatically, preserving virtually nothing of the original except for the
identity itself. Yet, this important phenomenon is largely absent from existing
video object segmentation (VOS) benchmarks. In this work, we close the gap by
collecting a new dataset for Video Object Segmentation under Transformations
(VOST). It consists of more than 700 high-resolution videos, captured in
diverse environments, which are 20 seconds long on average and densely labeled
with instance masks. A careful, multi-step approach is adopted to ensure that
these videos focus on complex object transformations, capturing their full
temporal extent. We then extensively evaluate state-of-the-art VOS methods and
make a number of important discoveries. In particular, we show that existing
methods struggle when applied to this novel task and that their main limitation
lies in over-reliance on static appearance cues. This motivates us to propose a
few modifications for the top-performing baseline that improve its capabilities
by better modeling spatio-temporal information. But more broadly, the hope is
to stimulate discussion on learning more robust video object representations
Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
Beyond the existing single-person and multiple-person human parsing tasks in
static images, this paper makes the first attempt to investigate a more
realistic video instance-level human parsing that simultaneously segments out
each person instance and parses each instance into more fine-grained parts
(e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding
Network (ATEN) that alternatively performs temporal encoding among key frames
and flow-guided feature propagation from other consecutive frames between two
key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the
instance-level parsing result for each key frame, which integrates both the
global human parsing and instance-level human segmentation into a unified
model. To balance between accuracy and efficiency, the flow-guided feature
propagation is used to directly parse consecutive frames according to their
identified temporal consistency with key frames. On the other hand, ATEN
leverages the convolution gated recurrent units (convGRU) to exploit temporal
changes over a series of key frames, which are further used to facilitate the
frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding network among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation
research. To demonstrate the superiority of our ATEN, extensive experiments are
conducted on the most popular video segmentation benchmark (DAVIS) and a newly
collected Video Instance-level Parsing (VIP) dataset, which is the first video
instance-level human parsing dataset comprised of 404 sequences and over 20k
frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link:
https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li
Object Permanence Emerges in a Random Walk along Memory
This paper proposes a self-supervised objective for learning representations
that localize objects under occlusion - a property known as object permanence.
A central question is the choice of learning signal in cases of total
occlusion. Rather than directly supervising the locations of invisible objects,
we propose a self-supervised objective that requires neither human annotation,
nor assumptions about object dynamics. We show that object permanence can
emerge by optimizing for temporal coherence of memory: we fit a Markov walk
along a space-time graph of memories, where the states in each time step are
non-Markovian features from a sequence encoder. This leads to a memory
representation that stores occluded objects and predicts their motion, to
better localize them. The resulting model outperforms existing approaches on
several datasets of increasing complexity and realism, despite requiring
minimal supervision and assumptions, and hence being broadly applicable
Object Discovery from Motion-Guided Tokens
Object discovery -- separating objects from the background without manual
labels -- is a fundamental open challenge in computer vision. Previous methods
struggle to go beyond clustering of low-level cues, whether handcrafted (e.g.,
color, texture) or learned (e.g., from auto-encoders). In this work, we augment
the auto-encoder representation learning framework with two key components:
motion-guidance and mid-level feature tokenization. Although both have been
separately investigated, we introduce a new transformer decoder showing that
their benefits can compound thanks to motion-guided vector quantization. We
show that our architecture effectively leverages the synergy between motion and
tokenization, improving upon the state of the art on both synthetic and real
datasets. Our approach enables the emergence of interpretable object-specific
mid-level features, demonstrating the benefits of motion-guidance (no labeling)
and quantization (interpretability, memory efficiency)